Probabilistic prototype models for attributed 

graphs 

S. Deepak Srinivasan, Klaus Obermayer 

Neural Information Processing Group, Technische Universitaet Berlin 
Sekretariat FR 2-1, Franklinstr. 28/29, 10587 Berlin 
{deepak,oby}@cs.tu-berlin.de 

Abstract 

This contribution proposes a new approach towards developing a class 
of probabilistic methods for classifying attributed graphs. The key concept 
is random attributed graph, which is defined as an attributed graph whose 
nodes and edges are annotated by random variables. Every node/edge 
has two random processes associated with it- occurence probability and 
the probability distribution over the attribute values. These are estimated 
within the ma:ximum likelihood framework. The likelihood of a random 
attributed graph to generate an outcome graph is used as a feature for 
classification. The proposed approach is fast and robust to noise. 

1 Introduction 



Attributed graphs are used to represent data as diverse as images, shapes, 
molecules and Protein structures. The statistical analysis of a dataset of pat- 
terns represented by graphical structures is a challenging problem and is closely 
related to tasks such as density estimation, mixture modelling, classification 
and clustering. There have been some efforts to develop probabilistic models 
for attributed graphs in the context of pattern recognition. Wong et al., [H H] 
propose the concept of a random graph which takes into account structural 
and contextual probabilities. An instantiation (outcome) of a random graph 
is an attributed graph, which enables the characterization of an ensemble of 
outcome graphs with a probability distribution. Sole-Ribalta et al., [3 general- 
ize the idea of random graphs to structure described random graphs (SDRG), 
with node and edge value distributions. Algorithms have been proposed where 
random attributed graph models are used for classification in the maximum 
likelihood framework. This framework has also been adopted by Seong et al., 
[4] to develop an incremental clustering algorithm for attributed graphs, and by 
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Sengupta et al., [5] to efficiently organize large structural modelbases for quick 
retrieval. There are two features of such a definition that are quite noteworthy- 
(i) both the structural and contextual probabilities are considered, which are 
estimated with suitable independence assumptions and (ii) the ability to deal 
with a wide variety of attribute values. 

The present contribution aims to develop a theory for probabilistic modeling 
of attributed graphs similar to generative models for feature vectors and demon- 
strate its utility to classify graph patterns. We propose random graph models 
as prototypes for a set of graphs with continuous node and edge attribute vec- 
tors and estimate its parameters. Instead of using the random graph models 
to classify the patterns in terms of maximum likelihood, we use the likelihood 
values as features for classification by subsequent discriminative classifiers such 
as support vector machines. 



2 Random attributed graph models 
2.1 Definitions 

A random attributed graph (simply referred to as random graphs) is a graph 
whose nodes and edges are finite probability distributions. Each outcome of a 
random graph is a labeled graph along with a morphism of the labeled graph 
into the random graph. The morphism specifies for each vertex (or edge) of 
the outcome graph which vertex (or edge) of the random graph generated it. 
The probability space of random graphs should be such that, the outcomes 
are attributed graphs with specified morphism relation and is complete. The 
definitions in this section follow 1 closely. 

Technically, the random graph 25 = (03, €) is defined to be such that: 

1. Each vertex U G 2J and edge e £ 2; is a finite probability distribution 

2. Ve e p(e = 4ifj{t) = 0) = 1 

3. The space of joint distribution of all random nodes and random edges is 
complete 

Condition 2 ensures that an edge can occur in an outcome only if both its 
ends (terminal nodes, given by cr(e)) occur. Completedness means that the 
space is indeed a (standard) probability space. Consider the probability space 
of the joint distribution. This space is the probability space of attributed graphs 
and every outcome is an attributed graph. 

^Elements of the random attributed graph are represented by fraStuc script. 
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Figure 1: A random attributed graph (centre) with two outcomes and their 
respective likehhoods 



Let G = {V, E) be an outcome graph. A morphism /i : — > 23 and : i? — > £ 
specifics the structural mapping between the random graph and its outcome. 
Thus, an outcome of a random graph is specified by the tuple (y, i?,7), where 
7 = (/u, v). It is to be noted that the mappings /i and u are into and the inverse 
mappings /i' and v' are such that some elements could be mapped to i.e. 
/x'(o) = if no morphism exists. The probability of an outcome graph is then 
the probability of its joint outcome described by the following 



Pe5(G,7) =prob{{v = Ai'(o),Vo £ 2J,a(v) = a,), (e = v'{z),'iz e e,/3(e) = /?,)} 

(1) 

where a{v) is the node attribute function that assigns attribute for every 
node and ai denotes the particular node attribute value. (3{e),j3i are the corre- 
sponding edge attribute function and values respectively. 

We make the following assumptions to make the definition computationally 
feasible- node occurences are independent, and edge occurences depend only on 
the nodes that the edge is incident to. 

Then, we can simplify Eq.([T|) to 



P0(G,7)= n n n n 

M'(«)=0 '''(c)#</> i^'(c)=0 

[] pro5(a(;u'(0) = a,)) J] prob{P{v' {z) ^ p,)) (2) 

where p(t)) denotes a probability that the node occurs and g(o) = 1 — p(ti) 
is the probability that does not occur. Similar notation has been adopted for 
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the edges as well. We note that formula in Eq. [5] decomposes the probability 
of an attributed graph instance as the product of probability of nodes/edges of 
generating random graphs that occur in the outcome, "not occurence" prob- 
ability of nodes/edges that are absent in the outcome, and the probability of 
the occuring nodes/edges to assume their respective attribute values. Figure [T] 
illustrates the above definitions with an example. 

2.2 Model based clustering of attributed graphs 

The estimation of structural parameters of a random graph given a dataset fol- 
lows from maximizing the likelihood: the node and edge occurence probabilities 
of random graphs are set to those values which maximize the likelihood of the 
dataset being generated from the random graph. The cost function is 



where \np{W{Gi)) is the likelihood that random graph W generates the 
graph Gi. We now consider the case where the node and edge attributes are 
given by feature vectors. In order to simplify the analytical treatment, we 
assume the attribute vectors to be generated by Gaussian distributions whose 
means and covariances are to be determined. 

Initially, we maximize the cost function with respect to the node and edge 
occurence probabilities p(ti) and p{c). As the node occurences are modelled by 
independent Bernoulli distributions, the maximum likelihood estimate is the 
fraction of its occurences in the dataset 



where n„ is the number of occurences of node in the sample set. Similar 
estimates hold good for the edges except that edge occurence probabilities are 
normalized by their respective node probabilities (accounting for the fact that 
the edges cannot occur if any of their end nodes do not occur). 

2.3 Density estimation 

We now consider the problem of estimating means and covariances of node and 
edge attribute distributions. It is possible to derive gradient descent update 
rule for the mean and covariance matrices. The vanilla gradient descent where 
the means and covariances are updated in the direction of the gradient (as it 




(3) 




(4) 
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is assumed to the steepest direction) is not ideal as it ignores the geometry of 
the underlying probability space. Therefore, we use natural gradient descent to 
estimate the mean and covariance online [HI [Z| ■ 

Natural gradient descent is a modification of the gradient descent procedure 
which takes into account the geometry of the manifold by incorporating a cor- 
rective term given by the Riemannian metric tensor. The equations for updating 
the means and covariances in the direction of the natural gradient are given by 

= Mb* + ilG^^"^ t^Mip) (5) 

=S„, +ryi^-iVs„Mp) (6) 

where, G, H are the Riemannian metric tensors in the space of means and 
covariances respectively. The Riemannian metric tensor in the space of mean 
vectors G is given by 

G = (7) 
Hence, the online update equation for the Gaussian means is given as below 

/^Dt+i = A^o, + ?7(x„ - Mb J (8) 
The metric tensor in the space of covariance matrices is defined as 

=IE((Vs„J-o)(Vs„^„r) (9) 
which after some simplification turns out to be 



H. = ^ (10) 
The online covariance estimation is given by a first order update rule as 

Sot+i = (1 - ?7)So, + 77(2;^ - ixr,t){x^ - n„, f (11) 
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3 Random graphs generating classification fea- 
tures 



Prototype based classification schemes are widespread in the domain of at- 
tributed graphs [8 . The key idea is to embed the graphs into a vector space in 
the following manner. Given a set of graphs {{Xi,yi)}, we synthesize a set of 
prototype graphs W : {{Wi,yi)} such that every graph Xi is embedded in M.'^ 
as 

X,^{d{X,,Wi),..,d{X,,Wk)) (12) 

where d{X, W) is a dissimilarity measure between the graphs and proto- 
types. The choice of prototypes influences the distance measure and hence the 
dissimilarity space. To illustrate, when the prototype graphs are chosen to be 
set median or means or cluster centres, it is clear as to how the distance is cal- 
culated. However, what is a suitable distance measure when we choose random 
graphs as prototypes ? 

The key lies in defining the Kullback-Leibler divergence between the proba- 
bility density of random prototype graph 2U and the true (hidden) probability 
distribution g [9lfT0] 

KL(q||(p(2U)) = -y'qlnP^ (13) 

The unknown probability distribution q is represented S{g — gi), where 
6{.) is the Dirac delta function at every data sample gt. Seperating the In term 
into ln(p(21J)) — ln(q) and noting 

J 5{g - 5,) ln(p(2II)) = ln(p(22Jg J) (14) 

which is the log-likelihood that the random graph 21J generates the out- 
come gi. Hence, likelihood (or more precisely its logarithm) could be used as 
a feature for classification naturally in the dissimilarity/distance representation 
framework. We also note here that a feature space embedding of graphs defined 
by likelihood values corresponds to the framework of Jaakkola et. al., illj who 
propose to use kernels derived from generative models. 

We thus summarize the scheme as below. Given a dataset of graphs repre- 
senting patterns belonging to different classes, sythesize first random attributed 
graphs acting as a model/prototype for each class. The largest graph (i.e. the 
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graph with maximum number of nodes) is initialized as prototype classwise. We 
then present every graph in the training set, align them with the corresponding 
prototype and update the node and edge occurence (structural) probabilities. 
The means and covariances are also updated according to the formulae in Eq. 
(8), (11). Once the parameters of the random prototype graphs are determined, 
we embed the dataset into a feature space by calculating the log-likelihood be- 
tween every graph in the dataset and every element in the prototype set. We 
point out the following notable features of this scheme: (1) More than one pro- 
totypes could be used for every class especially for datasets with diverse graphs 
in the same class. However, in our analysis and experiments, we consider just 
one random prototype per class in view of computational complexity of graph 
matching; (2) The size of the prototypes are bound by the size of the largest 
graph in the dataset (3) The number of graph matching operations during the 
parameter estimation stage is the size of the training set; once the prototype 
random graphs are sythesized, the training set (with N samples) and the test 
set (with M samples) have to be embedded in the likelihood space. This needs 
another {N + M) x K graph matching operations. 



4 Experiments 

4.1 Algorithmic details 

Matching attributed graphs- The problem of aligning random graphs with each 
of the sample graph and the likelihood calculation involve attributed graph 
matching. We adopt again the graduated assignment algorithm [T^ with a 
suitable compatibility function for this purpose. This algorithm minimizes a 
cost function as a function of match matrix over all possible matchings by an 
iterative procedure which estimates match matrix at every step and normalizing 
it. The matching quality is influenced by node compatibilites which measure 
how similar the nodes are structurally and attribute-value wise. In determining 
the morphism between random attributed graphs and outcome graphs, the com- 
patibility function is set to the likelihood of the node being structurally present 
in the outcome, thus in effect finding the morphism which is most probable. 

Classification procedure- Once the random graphs have been synthesized 
classwise, the dataset was embedded in to a feature space by calculating the 
log-likelihood of graphs beng generated by the prototype random graphs. In the 
feature space, various classifiers were learned on the training set and validated 
(by performance on the validation set or by cross-validation on the training set). 
The classifier exhibiting best validation performance was used to classify the test 
data. Extensive experimentations indicated that support vector machines with 
polynomial/Gaussian kernels yielded the best performance. All classification 
experiments were done using PyML software |13| . 
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4.2 Synthetic datasets 



We first analyzed the performance of this algorithm on sythetic datasets. We 
consider a dataset consisting of 200 graphs in training and test set belong- 
ing to two classes. The dataset is generated by considering distortions of two 
base graphs classwise at different levels viz. 5%, 10%, 15%, 20%. Node and 
edge attributes are generated according to a normal distribution. The noise ac- 
cording to the specified distortion level is added which modifies node and edge 
occurences and also their respective attributes. The nodes are then randomly 
permuted. The dataset is then divided uniformly into training and test sets. 
The classification scheme described in this chapter is referred to as RAG + LF 
(Random Attributed Graph model + Likelihood as a Feature). The standard 
k— Nearest Neighbour algorithm (kNN) in the graph domain is chosen as the 
benchmark classifier. 
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Figure 2: Classifier ROC plots for different distortion levels 

The classifiers are evaluated on the basis of the Area under the ROC curve 
[AUG) [H], Blue for RAG+LF and Red for kNN (Figure [2]). The classification 
rates of kNN compared with the proposed algorithm is shown in Table 1. As 
is seen, for low values of distortion, RAG + LF family of classifiers give near 
ideal performance. For higher noise levels, the algorithm does achieve higher 
robustness to noise compared to kNN. 
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Distortion 


5 


10 


15 


20 


RAG + LF 


97 


97 


81 


74 


kNN 


95 


84 


72 


56 



Table 1: Classification rates (%) on the synthetic datasets 



4.3 Experiments on lAM graph database repository 

A set of experiments were conducted on two standard datasets from the lAM 
graph dataset repository [TS]. A brief description of the dataset is reproduced 
below. 



dataset 


train, val, test 


Classes max V 


max E 


Letter (HIGH) 


750, 750, 750 


15 8 


3.1 


Fingerprint 


500, 300, 2000 


4 26 


4.42 



Table 2: Summary of main characteristics of the data sets 



In order to examine the performance of the proposed approach on a two class 
problem consisting of patterns from morphologically distinct classes, a reduced 
dataset called Fingerprint (AW) was created consisting of patterns belonging 
to only classes arch and whorl. 

4.4 Results and discussion 

The state-of-the-art techniques chosen are k-NN (chosen as Reference system) [15] , 
embedding based on Similarity Kernels (SK -I- SVM), embedding based on Lip- 
schitz Embedding (LE-|-SVM)[16', and Structurally-described random graphs 
(SDRG)^. The approach proposed here is referred to as Random Attributed 
Graph model -|- Likelihood as Feature (RAG + LF). RAG + ML denotes 
the method where a graph pattern is assigned to the class of random prototype 
graph, which has the maximum likelihood of having generated it. SK+SVM and 
LE-I-SVM refer to a family of related classifiers out of which the best performing 
model is chosen, hence, the comparision is biased towards the same. 



The following observations are made- the results compare well for the Fin- 
gerprint dataset overall, and for the Letter (HIGH) dataset compares well with 
SK -I- SVM and is superior to SDRG; although k-NN yields good results over- 
all, it faces the computationally challenging task of choosing k. For SK -I- SVM 
and LE + SVM, the task of choosing effective prototype set and calculating 
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Method 


Letter (HIGH) 


Fingerprint (AW) 


Fingerprint 


kNN 


82 


91.8 


76.6 


SK + SVM 


79.1 




41 


LE + SVM 


92.5 




82.8 


SDRG 


64.3 






RAG + ML 


67.2 


87.5 


61.1 


RAG + LF 


75.7*'t 


95.9 


78.2* 



Table 3: Classification rates (%) on lAM Graph Dataset. f indicates statis- 
ticahy significant improvement of RAG + LF over RAG + ML and SDRG at 
significance level 0.05 respectively 

the graph-edit distance between the dataset and prototype set is expensive as 
well and offers no analytical insight. The approach presented here is fast as 
it involves estimating the parameters of random graph model analytically and 
needs far less graph matching operations corresponding to generating only one 
class prototype model. The prototypes also give a good summary of node and 
edge occurence probabilities in the dataset and probability distributions of their 
attributes. Embedding the prototypes in the space spanned by likelihood val- 
ues offers statistically significant improvement with almost no significant loss of 
speed as there fast packages for SVM's and other classification algorithms. 



5 Conclusions 

This work builds upon the notion of random graph models with applications in 
structural pattern recognition with the following contributions- with indepen- 
dence assumptions a random attributed graph is represented as a joint random 
variable in its node and edge occurences and of their respective attribute val- 
ues, an analytical method to estimate the different probability distributions of 
a random graph model as a prototype given an ensemble of attributed graphs 
is presented using a maximum likelihood procedure, the utility of the random 
graph as a prototype is shown by using the likelihood of an outcome graph as 
a feature for classification. The proposed approach is suited to contexts involv- 
ing large number of graph data samples, as determination of random prototype 
graph is a density estimation problem. It is robust to noise and faster on ac- 
count of lesser number of graph matching operations that need to be performed 
in contrast to other approaches. 

There are several possible extensions to this approach- first, a method to 
derive a class of probabilistic clustering and classification algorithms is being 
currently investigated. This means that the random prototype graph is learned 
from the dataset in a procedure akin to a standard quantization type scheme. 
Second, is there a way to tie the classifiers in the feature space directly with 
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the learning of prototypes? To elaborate, it is important to investigate the link 
between type/family of classifiers on the feature space (due to likelihood) with 
how the random prototypes are estimated/learned. This would help to integrate 
probabilistic learning in the domain of graphs with discriminative methods for 
classification in the subsequent likelihood space. Lastly, the foundations of 
the random graph definitions needs to be explored- although node and edge 
independence is useful in that it allows an easy analytical estimation of model 
parameters, it is too strong an assumption. Is there a way to model dependencies 
of nodes and edges and their attributes (node/edge co-occurences)? Such a 
model woiild help enormously in probabilistic sub-str\icture analysis methods 
and also give possibly superior classification and clustering algorithms. 
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